NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

https://doi.org/10.1145/3706628.3708870

Zhuang, Jinming; Xiang, Shaojie; Chen, Hongzheng; Zhang, Niansong; Yang, Zhuoping; Mao, Tony; Zhang, Zhiru; Zhou, Peipei (February 2025, ACM)

As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD's AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE's specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance. To address these limitations, we introduce ARIES, an agile multi-level intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is open-sourced on GitHub: https://github.com/arc-research-lab/Aries.
more » « less
Free, publicly-accessible full text available February 27, 2026
Allo: A Programming Model for Composable Accelerator Design

https://doi.org/10.1145/3656401

Chen, Hongzheng; Zhang, Niansong; Xiang, Shaojie; Zeng, Zhichen; Dai, Mengjia; Zhang, Zhiru (June 2024, Proceedings of the ACM on Programming Languages)

Special-purpose hardware accelerators are increasingly pivotal for sustaining performance improvements in emerging applications, especially as the benefits of technology scaling continue to diminish. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures in a productive manner. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. Despite the introduction of several new accelerator design languages (ADLs) aiming to enhance or replace HLS, their advantages are more evident in relatively simple applications with a single kernel. Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened. In this paper, we introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. This approach facilitates holistic optimizations that span across function boundaries. We conduct comprehensive experiments on commonly-used HLS benchmarks and several realistic deep learning models. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. For the GPT2 model, the inference latency of the Allo generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher energy efficiency, demonstrating the capability of Allo to handle large-scale designs.
more » « less
Full Text Available
Formal Verification of Source-to-Source Transformations for HLS

https://doi.org/10.1145/3626202.3637563

Pouchet, Louis-Noël; Tucker, Emily; Zhang, Niansong; Chen, Hongzheng; Pal, Debjit; Rodríguez, Gabriel; Zhang, Zhiru (March 2024, International Symposium on Field Programmable Gate Arrays (FPGA'2024))
Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator

https://doi.org/10.1109/LCA.2023.3341389

Golden, Courtney; Ilan, Dan; Huang, Caroline; Zhang, Niansong; Zhang, Zhiru; Batten, Christopher (January 2023, IEEE Computer Architecture Letters)

Full Text Available
HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs

https://doi.org/10.1145/3490422.3502369

Xiang, Shaojie; Lai, Yi-Hsiang; Zhou, Yuan; Chen, Hongzheng; Zhang, Niansong; Pal, Debjit; Zhang, Zhiru (February 2022, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays)

Full Text Available
Accelerator design with decoupled hardware customizations: benefits and challenges: invited

https://doi.org/10.1145/3489517.3530681

Pal, Debjit; Lai, Yi-Hsiang; Xiang, Shaojie; Zhang, Niansong; Chen, Hongzheng; Casas, Jeremy; Cocchini, Pasquale; Yang, Zhenkun; Yang, Jin; Pouchet, Louis-Noël; et al (July 2022, ACM/IEEE Design Automation Conference (DAC))

Full Text Available

Search for: All records